Overview

Give a brief a description of your project and its goal(s), what data you are using to complete it, and what three faculty/staff in different fields you have spoken to about your project with a brief summary of what you learned from each person. Include a link to your final project GitHub repository.

This project considers the burdens of poor air quality and its health consequences, and how they fall on different American communities. To explore and understand these relationships, I use daily site-level testing data from the EPA’s Air Quality System for small particulate matter (PM2.5) from 2019, linked with census data for the tract surrounding each testing site. I am still determining how I can best incorporate data on the near-term health effects of poor air quality, although my main goal is not to use air quality to predict those health outcomes, but rather to determine whether there are differences (e.g. in extent/significance) between how community characteristics predict air quality and how those same characteristics predict the prevalence of events like emergency department presentations for asthma exacerbation. In developing a plan for this project I met with Drs. Anil Vachani, Gary Weissman, and John P. Reilly, who provided insights on the causal pathway from pollution sources to specific EPA-measured pollutants to short- and long-term health outcomes; potential data sources and analytic approaches; the ways in which “natural experiments” related to major changes in pollutant generation have been used to compare the effects of perturbation from a steady state; and more. Materials for this project can be found in this GitHub repository.

Introduction

Describe the problem addressed, its significance, and some background to motivate the problem. Explain why your problem is interdisciplinary, what fields can contribute to its understanding, and incorporate background related to what you learned from meeting with faculty/staff.

Poor air quality can have a substantial negative influence on health and quality of life, but different people will have different resources, incentives, and tradeoffs to consider in determining where to live, and we might expect that the relationship between who lives where and air quality is more complex than people with more wealth, income, and social power concentrating in places with better air quality. By considering many different community characteristics in building models to predict poor air quality, it is possible to learn which factors are most strongly associated with air quality and how successfully models trained on community characteristics can predict poor air quality and related health events.

Methods

Describe the data used and general methodological approach. Subsequently, incorporate full R code necessary to retrieve and clean data, and perform analysis. Be sure to include a description of code so that others (including your future self) can understand what you are doing and why.

I begin by defining a function to create a data frame from the results of a GET call to the EPA’s Air Quality System API. I create objects with relevant reference lists and define a vector for the “lower 48” U.S. states and Washington, D.C. for later use.

# Retrieve environment variables with EPA AQS API credentials
api.email <- Sys.getenv("EPA_AQS_EMAIL")
api.key   <- Sys.getenv("EPA_AQS_KEY")

# Function to create a data frame based on an API call
download.AQS <- function(whichData,customParams) { 
  rootpath <- "https://aqs.epa.gov/data/api/"
  morepath <- paste0(rootpath,whichData)
  fullParams <- append(list(email = api.email, key = api.key),customParams)
  step1.json <- httr::GET(url = morepath, query = fullParams)
  step2.parsed <- jsonlite::fromJSON(httr::content(step1.json,"text"), simplifyVector = FALSE)
  step3.df <- tibble(Data = step2.parsed$Data) %>% unnest_wider(.,Data)
  return(step3.df)
}

# Fetch documentation for daily data summaries, state code list
dailyData.dictionary <- download.AQS("metaData/fieldsByService",list(service = "dailyData"))
list.states <- download.AQS("list/states",list())

# Omitting state codes other than 48 states + D.C.
list.states.lower48 <- list.states %>% filter(!(code %in% c("02","15","66","72","78","80","CC")))
list.states.lower48
## # A tibble: 49 x 2
##    code  value_represented   
##    <chr> <chr>               
##  1 01    Alabama             
##  2 04    Arizona             
##  3 05    Arkansas            
##  4 06    California          
##  5 08    Colorado            
##  6 09    Connecticut         
##  7 10    Delaware            
##  8 11    District Of Columbia
##  9 12    Florida             
## 10 13    Georgia             
## # … with 39 more rows
lower48.codes <- as.vector(as.matrix(list.states.lower48[,1]))

To manage downloading daily PM2.5 data for each state for all of 2019, I define another function to loop through the state FIPS codes for given date parameters. Executing this step with 1 call per month of data takes roughly 90 minutes, so I save the resulting data frame and load it in the chunk below.

# Pull daily PM2.5 (parameter 88101) data for these states, with arguments for date range
get.daily.lower48 <- function(b,e){
collector <- data.frame()
for (s in seq_along(lower48.codes)) {
  newstate <- download.AQS("dailyData/byState", list(param="88101", bdate=b, edate=e, state=list.states.lower48[s,1]))
  collector <- bind_rows(collector,newstate)
  }
  return(collector)
}

# Run in monthly batches
dailyPM2.5_201901 <- get.daily.lower48("20190101","20190131")
dailyPM2.5_201902 <- get.daily.lower48("20190201","20190228")
dailyPM2.5_201903 <- get.daily.lower48("20190301","20190331")
dailyPM2.5_201904 <- get.daily.lower48("20190401","20190430")
dailyPM2.5_201905 <- get.daily.lower48("20190501","20190531")
dailyPM2.5_201906 <- get.daily.lower48("20190601","20190630")
dailyPM2.5_201907 <- get.daily.lower48("20190701","20190731")
dailyPM2.5_201908 <- get.daily.lower48("20190801","20190831")
dailyPM2.5_201909 <- get.daily.lower48("20190901","20190930")
dailyPM2.5_201910 <- get.daily.lower48("20191001","20191031")
dailyPM2.5_201911 <- get.daily.lower48("20191101","20191130")
dailyPM2.5_201912 <- get.daily.lower48("20191201","20191231")
 
# Concatenate monthly batch files into a single file for 2019
all.2019.dailyPM2.5 <- bind_rows( dailyPM2.5_201901
                                 ,dailyPM2.5_201902
                                 ,dailyPM2.5_201903
                                 ,dailyPM2.5_201904
                                 ,dailyPM2.5_201905
                                 ,dailyPM2.5_201906
                                 ,dailyPM2.5_201907
                                 ,dailyPM2.5_201908
                                 ,dailyPM2.5_201909
                                 ,dailyPM2.5_201910
                                 ,dailyPM2.5_201911
                                 ,dailyPM2.5_201912)

# Save full 2019 raw file
saveRDS(all.2019.dailyPM2.5, file = "all2019dailyPM25.RDS")

I test combinations of testing site location identifiers and add a single site_id variable to the 2019 PM2.5 data set by concatenating location identifiers. I also create a data frame with one observation per testing site.

# Load the full 2019 raw file
all.2019.dailyPM2.5 <- readRDS("all2019dailyPM25.RDS")
str(all.2019.dailyPM2.5)
## 'data.frame':    1448813 obs. of  31 variables:
##  $ state_code         : chr  "01" "01" "01" "01" ...
##  $ county_code        : chr  "027" "027" "027" "027" ...
##  $ site_number        : chr  "0001" "0001" "0001" "0001" ...
##  $ parameter_code     : chr  "88101" "88101" "88101" "88101" ...
##  $ poc                : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ latitude           : num  33.3 33.3 33.3 33.3 33.3 ...
##  $ longitude          : num  -85.8 -85.8 -85.8 -85.8 -85.8 ...
##  $ datum              : chr  "NAD83" "NAD83" "NAD83" "NAD83" ...
##  $ parameter          : chr  "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
##  $ sample_duration    : chr  "24 HOUR" "24 HOUR" "24 HOUR" "24 HOUR" ...
##  $ pollutant_standard : chr  "PM25 24-hour 2006" "PM25 Annual 2006" "PM25 24-hour 2012" "PM25 Annual 2012" ...
##  $ date_local         : chr  "2019-01-03" "2019-01-03" "2019-01-03" "2019-01-03" ...
##  $ units_of_measure   : chr  "Micrograms/cubic meter (LC)" "Micrograms/cubic meter (LC)" "Micrograms/cubic meter (LC)" "Micrograms/cubic meter (LC)" ...
##  $ event_type         : chr  "None" "None" "None" "None" ...
##  $ observation_count  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ observation_percent: num  100 100 100 100 100 100 100 100 100 100 ...
##  $ validity_indicator : chr  "Y" "Y" "Y" "Y" ...
##  $ arithmetic_mean    : num  2.9 2.9 2.9 2.9 2.7 2.7 2.7 2.7 6.6 6.6 ...
##  $ first_max_value    : num  2.9 2.9 2.9 2.9 2.7 2.7 2.7 2.7 6.6 6.6 ...
##  $ first_max_hour     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ aqi                : int  12 12 12 12 11 11 11 11 28 28 ...
##  $ method_code        : chr  "145" "145" "145" "145" ...
##  $ method             : chr  "R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC - Gravimetric" "R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC - Gravimetric" "R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC - Gravimetric" "R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC - Gravimetric" ...
##  $ local_site_name    : chr  "ASHLAND" "ASHLAND" "ASHLAND" "ASHLAND" ...
##  $ site_address       : chr  "ASHLAND AIRPORT" "ASHLAND AIRPORT" "ASHLAND AIRPORT" "ASHLAND AIRPORT" ...
##  $ state              : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ county             : chr  "Clay" "Clay" "Clay" "Clay" ...
##  $ city               : chr  "Ashland" "Ashland" "Ashland" "Ashland" ...
##  $ date_of_last_change: chr  "2020-03-10" "2020-03-10" "2020-03-10" "2020-03-10" ...
##  $ cbsa_code          : chr  NA NA NA NA ...
##  $ cbsa               : chr  NA NA NA NA ...
# Create a file of all unique combinations of geographic indicators for testing sites
testing.sites <- all.2019.dailyPM2.5 %>%
                      dplyr::select(state_code, state, 
                             county_code, county,
                             city, cbsa_code, cbsa,
                             site_number, local_site_name, site_address,
                             latitude, longitude) %>%
                      unique() %>%
                      arrange(state_code,county_code,site_number)
dim(testing.sites)
## [1] 965  12
# 965 combinations

# Create a single unique ID column for site
testing.sites %>% dplyr::select(site_number) %>% unique() %>% dim()
## [1] 253   1
testing.sites %>% dplyr::select(state_code,site_number) %>% unique() %>% dim()
## [1] 770   2
testing.sites %>% dplyr::select(state_code,county_code,site_number) %>% unique() %>% dim()
## [1] 965   3
# Only 253 unique site numbers. 770 state-site combos. 
# State + county + site number *is* unique (965 combos)

clean.2019.dailyPM2.5 <- all.2019.dailyPM2.5 %>%
  mutate(site_id = paste0(state_code,county_code,site_number))

Using the tigris::tracts function, I retrieve the geometries, GEOIDs, and other spatial information for census tracts and join each AQS testing site to the data for the tract within which it lies, based on the latitude and longitude from AQS. I looping through the state FIPS codes one at a time and stack the results into a single crosswalk file. This step is time-consuming so I save the resulting file and load it in the next step.

# Prepare for spatial merge. Create sf file from testing site coordinates
testing.sites.sf.points <- st_as_sf(testing.sites, coords = c('longitude','latitude'), crs = 4269)

# Download census tract sf shapefiles, one state at a time.
# Perform spatial joins one state at a time to avoid having to build the national tract shapefile.
for (s in seq_along(lower48.codes)) {
  newstate <- lower48.codes[s]
  newtracts <- tracts(newstate)
  newjoin <- st_join(newtracts, testing.sites.sf.points, join = st_contains, left=FALSE)
  if (s==1){
    testing.sites.tracts <- newjoin
  } else {
    testing.sites.tracts <- bind_rows(testing.sites.tracts,newjoin)
  }
}

# Save file with testing sites <> census tract
saveRDS(testing.sites.tracts, file = "testing_sites_tracts.RDS")

From tidycensus I now retrieve the variable list for the 2018 5-year American Community Survey. After reviewing the concepts from those variables as well as the questionnaire, I create an object with a set of variables to test for associations with the AQS data.

testing.sites.tracts <- readRDS("testing_sites_tracts.RDS")
head(testing.sites.tracts)
## Simple feature collection with 6 features and 22 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -88.10077 ymin: 30.43231 xmax: -84.96303 ymax: 34.00093
## geographic CRS: NAD83
##      STATEFP COUNTYFP TRACTCE       GEOID   NAME            NAMELSAD MTFCC FUNCSTAT     ALAND  AWATER    INTPTLAT     INTPTLON state_code   state county_code     county        city cbsa_code                      cbsa site_number                   local_site_name                                             site_address                       geometry
## ...1      01      055  001600 01055001600     16     Census Tract 16 G5020        S  10464679 1540235 +33.9780622 -085.9775275         01 Alabama         055     Etowah     Gadsden     23460               Gadsden, AL        0010                GADSDEN C. COLLEGE                    1001 WALLACE DRIVE, GADSDEN, AL 35902 MULTIPOLYGON (((-85.99934 3...
## ...2      01      097  005300 01097005300     53     Census Tract 53 G5020        S   4987328  294542 +30.7789148 -088.0797620         01 Alabama         097     Mobile   Chickasaw     33660                Mobile, AL        0003                         CHICKASAW     Iroquois and Azalea, CHICKASAW, MOBILE CO.,  ALABAMA MULTIPOLYGON (((-88.10077 3...
## ...3      01      003  011102 01003011102 111.02 Census Tract 111.02 G5020        S  44204283  813845 +30.4799296 -087.8330813         01 Alabama         003    Baldwin    Fairhope     19300 Daphne-Fairhope-Foley, AL        0010                 FAIRHOPE, Alabama FAIRHOPE HIGH SCHOOL, 1 PIRATE DRIVE, FAIRHOPE,  ALABAMA MULTIPOLYGON (((-87.88641 3...
## ...4      01      101  002500 01101002500     25     Census Tract 25 G5020        S   4488073   37253 +32.4177882 -086.2722538         01 Alabama         101 Montgomery  Montgomery     33860            Montgomery, AL        1002                        MOMS, ADEM                  1350 COLISEUM BLVD, MONTGOMERY, ALABAMA MULTIPOLYGON (((-86.28735 3...
## ...5      01      027  959000 01027959000   9590   Census Tract 9590 G5020        S 179658298 1129596 +33.3093759 -085.8820883         01 Alabama         027       Clay     Ashland      <NA>                      <NA>        0001                           ASHLAND                                          ASHLAND AIRPORT MULTIPOLYGON (((-86.01156 3...
## ...6      01      113  030800 01113030800    308    Census Tract 308 G5020        S  21342370  417245 +32.4345151 -084.9805005         01 Alabama         113    Russell Phenix City     17980           Columbus, GA-AL        0003 PHENIX CITY - SOUTH GIRARD SCHOOL         510 6th Place South, Phenix City, Alabama  36869 MULTIPOLYGON (((-85.01535 3...
# Load a data frame with all 5-year ACS variables
census.1yr.vars <- load_variables(dataset = "acs5",year = 2018)
head(census.1yr.vars)
## # A tibble: 6 x 3
##   name       label                                concept                                  
##   <chr>      <chr>                                <chr>                                    
## 1 B00001_001 Estimate!!Total                      UNWEIGHTED SAMPLE COUNT OF THE POPULATION
## 2 B00002_001 Estimate!!Total                      UNWEIGHTED SAMPLE HOUSING UNITS          
## 3 B01001_001 Estimate!!Total                      SEX BY AGE                               
## 4 B01001_002 Estimate!!Total!!Male                SEX BY AGE                               
## 5 B01001_003 Estimate!!Total!!Male!!Under 5 years SEX BY AGE                               
## 6 B01001_004 Estimate!!Total!!Male!!5 to 9 years  SEX BY AGE
# Simplify the variable list to make it easier to review the major concepts represented
census.1yr.concept.counts <- 
  census.1yr.vars %>% 
  mutate(root.concept = str_split_fixed(concept," BY",2)[,1]) %>%
  mutate(root.concept = str_split_fixed(root.concept," \\(",2)[,1]) %>% 
      # Double escape needed to match open parenthesis!
  mutate(root.name = substring(name,1,6)) %>%
  group_by(root.name,root.concept) %>%
  summarise(obs=n(), .groups="keep") %>%
  arrange(root.name,-obs)

# View(sort(unique(census.1yr.concept.counts$root.concept)))

# After reviewing the variable list, create an object with the ones of interest
final.acs.var.list <- census.1yr.vars %>% filter(name %in% c(
     "B01002_001" # median age
    ,"B03002_003","B03002_001" # Hon-Hisp-White-alone // Denom
    ,"B02009_001","B02001_001" # Black race, alone or in combination // Denom
    ,"B06008_003","B06008_001" # Now married // Denom
    ,"B23007_002","B23007_001" # Children under 18yrs in household // Denom
    ,"B15003_022","B15003_023","B15003_024","B15003_025","B15003_001" # Bachelors // Masters // Professional // Doctorate // Denom
    ,"B23022_027","B23022_026","B23022_003","B23022_002" # Worked in past 12 months-Female // Denom-Female // p12-Male // Denom-Male
    ,"B21001_002","B21001_001" # Veteran // Denom
    ,"B17001_002","B17001_001" # Past 12mo income at or below poverty level, Denom
    ,"B22010_002","B22010_001" # Received SNAP in past 12mo, Denom
    ,"B22008_001"  # Median household income, past 12mo
    ,"B25008_003","B25008_001" # Renter occupied, Denom
    ,"B25024_002","B25024_003","B25024_001" # Single-fam, detached // single-fam, attached // Denom
    ,"B25064_001" # Median gross rent
    ,"B25088_002" # Median monthly owner costs for households with a mortgage
    ,"B08126_004","B08126_002","B08126_001" # INDUSTRY: Manufacturing // Agriculture, fishing, mining // Denom
    ,"B08006_003","B08006_001" # Drove to work alone // Denom
    ,"B08013_001" # Aggregate travel time to work in minutes
  )) %>% arrange(name)

final.acs.var.list
## # A tibble: 38 x 3
##    name       label                                                concept                                                                       
##    <chr>      <chr>                                                <chr>                                                                         
##  1 B01002_001 Estimate!!Median age --!!Total                       MEDIAN AGE BY SEX                                                             
##  2 B02001_001 Estimate!!Total                                      RACE                                                                          
##  3 B02009_001 Estimate!!Total                                      BLACK OR AFRICAN AMERICAN ALONE OR IN COMBINATION WITH ONE OR MORE OTHER RACES
##  4 B03002_001 Estimate!!Total                                      HISPANIC OR LATINO ORIGIN BY RACE                                             
##  5 B03002_003 Estimate!!Total!!Not Hispanic or Latino!!White alone HISPANIC OR LATINO ORIGIN BY RACE                                             
##  6 B06008_001 Estimate!!Total                                      PLACE OF BIRTH BY MARITAL STATUS IN THE UNITED STATES                         
##  7 B06008_003 Estimate!!Total!!Now married, except separated       PLACE OF BIRTH BY MARITAL STATUS IN THE UNITED STATES                         
##  8 B08006_001 Estimate!!Total                                      SEX OF WORKERS BY MEANS OF TRANSPORTATION TO WORK                             
##  9 B08006_003 Estimate!!Total!!Car, truck, or van!!Drove alone     SEX OF WORKERS BY MEANS OF TRANSPORTATION TO WORK                             
## 10 B08013_001 Estimate!!Aggregate travel time to work (in minutes) AGGREGATE TRAVEL TIME TO WORK (IN MINUTES) OF WORKERS BY SEX                  
## # … with 28 more rows

Once more I loop through the state FIPS codes, retrieving the ACS variables of interest by tract as a flat file using tidycensus::get_acs and preserving the data only for the census tracts that contain AQS testing sites. As this is another slow step, I save the resulting data frame.

head(testing.sites.tracts)

# Keep only what is needed to make the ACS calls in a small data frame
tst.minimal <- testing.sites.tracts %>% 
                mutate(site_id = paste0(state_code,county_code,site_number)) %>%
                dplyr::select(site_id, GEOID, STATEFP, COUNTYFP)
st_geometry(tst.minimal) <- NULL

# Loop through existing list of state codes
lower48.codes

# Make ACS calls one state at a time, only for counties with AQS testing sites
# Preserve estimate columns, ditch MOE
# Join to testing site list
# Stack ACS data together in a single data frame

acs.loop <- function(){
  for (st in seq_along(lower48.codes)) {
    
    newstate <- lower48.codes[st]
    
    county.list <- tst.minimal %>% 
                  filter(STATEFP==newstate) %>% 
                  dplyr::select(COUNTYFP) %>%
                  arrange(COUNTYFP) %>%
                  as.matrix() %>% as.vector() %>% as.list()
    
    results.acs <- get_acs( geography = "tract"
                          ,variables=as.list(final.acs.var.list$name)
                          ,state = newstate
                          ,county = county.list
                          ,output = "wide")
    
    smaller.acs <- results.acs %>% dplyr::select(GEOID,matches(".*_.*E$"))
  
    linked.acs <- tst.minimal %>% 
                filter(STATEFP==newstate) %>%
                left_join(.,smaller.acs,by="GEOID")
    
    if (st==1){
      collect <- linked.acs
    } else {
      collect <- bind_rows(collect,linked.acs)
    }
  }
  return(collect)
}

testing.sites.acs <- acs.loop()
testing.sites.acs

saveRDS(testing.sites.acs, file = "testing_sites_acs.RDS")

With all the ACS data I need, I can create clean versions of the variables to be included in the analysis. I exclude tracts that do not have complete data for all 19 ACS-based variables or with fewer than 500 respondents in the sample (using the denominator from the set of race variables).

testing.sites.acs <- readRDS("testing_sites_acs.RDS")
head(testing.sites.acs)
##     site_id       GEOID STATEFP COUNTYFP B01002_001E B02001_001E B02009_001E B03002_001E B03002_003E B06008_001E B06008_003E B08006_001E B08006_003E B08013_001E B08126_001E B08126_002E B08126_004E B15003_001E B15003_022E B15003_023E B15003_024E B15003_025E B17001_001E B17001_002E B21001_001E B21001_002E B22008_001E B22010_001E B22010_002E B23007_001E B23007_002E B23022_002E B23022_003E
## 1 010550010 01055001600      01      055        37.1        3300        2264        3300        1011        2838         800        1303        1127       23920        1303          23         448        2295         219         110          60          25        3173         910        2724         273       36250        1232         156         653         200        1094         710
## 2 010970003 01097005300      01      097        34.2        1970         746        1970        1129        1410         472         639         564       12990         639           0          88        1205         120          60          14           0        1970         541        1319         120       32829         756         173         457         225         440         303
## 3 010030010 01003011102      01      003        43.0        4385          95        4385        3907        3479        2120        1953        1637       46550        1953           0         186        2936         762         260         117           8        4385         317        3215         343       64851        1633         104        1219         539        1191         907
## 4 011011002 01101002500      01      101        35.3        2317        1538        2317         500        1753         584         825         765       16105         825           0         139        1455         176          91           6          11        2304         567        1647         108       34109         852         292         586         283         653         482
## 5 010270001 01027959000      01      027        47.1        2975         506        2975        2141        2485        1044        1066         969       21705        1066           0         337        1988         174          49          20          22        2814        1053        2417         179       26875        1240         216         778         247         792         530
## 6 011130003 01113030800      01      113        39.5        3254        3024        3254         212        2755         583        1103         941       25900        1103           0         188        2066          59          30           0           0        3159        1425        2566         249       21017        1329         381         602         155         944         544
##   B23022_026E B23022_027E B25008_001E B25008_003E B25024_001E B25024_002E B25024_003E B25064_001E B25088_002E
## 1        1170         735        3173         911        1650        1522          24         841         840
## 2         556         307        1970         895         896         627          46         598        1052
## 3        1409        1058        4385         328        1880        1555          19         937        1381
## 4         850         521        2317        1263        1122         969          42         780         856
## 5         971         602        2838         879        1569        1051           7         394         971
## 6        1258         734        3249        1509        1599        1126          21         604         944
# Create percentage variables and scaled medians/totals to use in modeling
testing.sites.acs.ready <- testing.sites.acs %>%
  dplyr::rename( age.median       = B01002_001E
                ,income.hh.median = B22008_001E 
                ,rent.median      = B25064_001E
                ,home.pmt.median  = B25088_002E
                ,commute.time.agg = B08013_001E) %>%
  mutate( race.black.any          = 100*B02009_001E/B02001_001E
         ,ethn.nhisp.white.alone  = 100*B03002_003E/B03002_001E
         ,married                 = 100*B06008_003E/B06008_001E
         ,kids                    = 100*B23007_002E/B23007_001E
         ,bachelor.plus           = 100*(B15003_022E + B15003_023E + B15003_024E + B15003_025E)/B15003_001E
         ,veteran                 = 100*B21001_002E/B21001_001E
         ,manufacturing           = 100*B08126_004E/B08126_001E
         ,farm.fish.mining        = 100*B08126_002E/B08126_001E
         ,commutes.car.alone      = 100*B08006_003E/B08006_001E
         ,renter                  = 100*B25008_003E/B25008_001E
         ,single.fam.home         = 100*(B25024_002E + B25024_003E)/B25024_001E
         ,worked.12mo             = 100*(B23022_027E + B23022_003E)/(B23022_026E + B23022_002E) 
         ,poverty.12mo            = 100*B17001_002E/B17001_001E
         ,snap.12mo               = 100*B22010_002E/B22010_001E
         ,denominator             = B02001_001E) %>%
  dplyr::select(-matches("^B[012].*"))

head(testing.sites.acs.ready)
##     site_id       GEOID STATEFP COUNTYFP age.median commute.time.agg income.hh.median rent.median home.pmt.median race.black.any ethn.nhisp.white.alone  married     kids bachelor.plus   veteran manufacturing farm.fish.mining commutes.car.alone    renter single.fam.home worked.12mo poverty.12mo snap.12mo denominator
## 1 010550010 01055001600      01      055       37.1            23920            36250         841             840      68.606061              30.636364 28.18887 30.62787     18.039216 10.022026      34.38219         1.765157           86.49271 28.710999        93.69697    63.82509     28.67948 12.662338        3300
## 2 010970003 01097005300      01      097       34.2            12990            32829         598            1052      37.868020              57.309645 33.47518 49.23414     16.099585  9.097801      13.77152         0.000000           88.26291 45.431472        75.11161    61.24498     27.46193 22.883598        1970
## 3 010030010 01003011102      01      003       43.0            46550            64851         937            1381       2.166477              89.099202 60.93705 44.21657     39.066757 10.668740       9.52381         0.000000           83.81976  7.480046        83.72340    75.57692      7.22919  6.368647        4385
## 4 011011002 01101002500      01      101       35.3            16105            34109         780             856      66.378938              21.579629 33.31432 48.29352     19.518900  6.557377      16.84848         0.000000           92.72727 54.510142        90.10695    66.73320     24.60938 34.272300        2317
## 5 010270001 01027959000      01      027       47.1            21705            26875         394             971      17.008403              71.966387 42.01207 31.74807     13.329980  7.405875      31.61351         0.000000           90.90056 30.972516        67.43149    64.20874     37.42004 17.419355        2975
## 6 011130003 01113030800      01      113       39.5            25900            21017         604             944      92.931776               6.515058 21.16152 25.74751      4.307841  9.703819      17.04442         0.000000           85.31278 46.445060        71.73233    58.03815     45.10921 28.668172        3254
summary(testing.sites.acs.ready)
##    site_id             GEOID             STATEFP            COUNTYFP           age.median    commute.time.agg income.hh.median  rent.median     home.pmt.median race.black.any   ethn.nhisp.white.alone    married            kids        bachelor.plus      veteran       manufacturing    farm.fish.mining  commutes.car.alone     renter       single.fam.home   worked.12mo      poverty.12mo    
##  Length:965         Length:965         Length:965         Length:965         Min.   :19.20   Min.   :  2175   Min.   :  6953   Min.   : 271.0   Min.   : 541    Min.   : 0.000   Min.   : 0.00          Min.   : 1.076   Min.   :  0.00   Min.   : 0.00   Min.   : 0.000   Min.   : 0.000   Min.   : 0.0000   Min.   : 3.144     Min.   :  0.00   Min.   :  0.00   Min.   : 9.004   Min.   : 0.6641  
##  Class :character   Class :character   Class :character   Class :character   1st Qu.:32.70   1st Qu.: 25162   1st Qu.: 35088   1st Qu.: 706.0   1st Qu.:1030    1st Qu.: 1.767   1st Qu.:36.90          1st Qu.:31.764   1st Qu.: 35.50   1st Qu.:13.25   1st Qu.: 4.688   1st Qu.: 5.119   1st Qu.: 0.0000   1st Qu.:69.001     1st Qu.: 24.17   1st Qu.: 51.07   1st Qu.:68.488   1st Qu.: 9.6620  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character   Median :37.20   Median : 37895   Median : 48700   Median : 863.0   Median :1282    Median : 5.738   Median :66.52          Median :42.238   Median : 42.71   Median :21.14   Median : 7.185   Median : 9.038   Median : 0.7683   Median :77.513     Median : 39.78   Median : 68.73   Median :75.509   Median :16.6063  
##                                                                              Mean   :37.75   Mean   : 45979   Mean   : 52509   Mean   : 942.3   Mean   :1412    Mean   :16.031   Mean   :59.65          Mean   :41.888   Mean   : 43.10   Mean   :25.23   Mean   : 7.422   Mean   :10.364   Mean   : 3.0371   Mean   :74.382     Mean   : 42.77   Mean   : 63.96   Mean   :74.135   Mean   :19.7553  
##                                                                              3rd Qu.:42.50   3rd Qu.: 58158   3rd Qu.: 64558   3rd Qu.:1075.0   3rd Qu.:1621    3rd Qu.:19.825   3rd Qu.:85.59          3rd Qu.:53.168   3rd Qu.: 50.56   3rd Qu.:33.17   3rd Qu.: 9.661   3rd Qu.:14.093   3rd Qu.: 2.8991   3rd Qu.:83.803     3rd Qu.: 58.57   3rd Qu.: 81.32   3rd Qu.:81.789   3rd Qu.:27.9066  
##                                                                              Max.   :70.00   Max.   :305225   Max.   :193239   Max.   :3017.0   Max.   :4001    Max.   :99.358   Max.   :98.92          Max.   :74.400   Max.   :100.00   Max.   :88.74   Max.   :34.728   Max.   :47.056   Max.   :65.6162   Max.   :95.970     Max.   :100.00   Max.   :100.00   Max.   :95.032   Max.   :73.4463  
##                                                                              NA's   :2       NA's   :7        NA's   :6        NA's   :12       NA's   :27      NA's   :2        NA's   :2              NA's   :2        NA's   :5        NA's   :2       NA's   :2        NA's   :5        NA's   :5         NA's   :5          NA's   :5        NA's   :5        NA's   :2        NA's   :5        
##    snap.12mo      denominator   
##  Min.   : 0.00   Min.   :    0  
##  1st Qu.: 6.49   1st Qu.: 2853  
##  Median :13.87   Median : 4145  
##  Mean   :17.09   Mean   : 4521  
##  3rd Qu.:24.60   3rd Qu.: 5646  
##  Max.   :72.19   Max.   :29636  
##  NA's   :5
# testing.sites.acs.ready %>% dplyr::filter(denominator < 1000) %>% View()
# testing.sites.acs.ready %>% dplyr::filter(denominator < 500 | !complete.cases(.)) %>% View()

length(unique(testing.sites.acs.ready$GEOID))
## [1] 953
# 953 - not unique by tract

# 34 tracts to exclude based on incomplete data and total denominator
# Bring the aggregates and medians to a comparable scale
testing.sites.acs.complete <- 
  testing.sites.acs.ready %>% 
  mutate(commute.time.agg  = commute.time.agg/10000
         ,income.hh.median = income.hh.median/10000
         ,rent.median      = rent.median/100
         ,home.pmt.median  = home.pmt.median/100
         ) %>%
  dplyr::rename(commute.time.agg.10k  = commute.time.agg
                ,income.hh.median.10k = income.hh.median
                ,rent.median.100      = rent.median
                ,home.pmt.median.100  = home.pmt.median) %>%
  dplyr::filter(denominator >= 500 & complete.cases(.))

summary(testing.sites.acs.complete)
##    site_id             GEOID             STATEFP            COUNTYFP           age.median    commute.time.agg.10k income.hh.median.10k rent.median.100  home.pmt.median.100 race.black.any   ethn.nhisp.white.alone    married            kids        bachelor.plus        veteran       manufacturing    farm.fish.mining  commutes.car.alone     renter       single.fam.home   worked.12mo   
##  Length:931         Length:931         Length:931         Length:931         Min.   :19.60   Min.   : 0.2175      Min.   : 1.060       Min.   : 2.710   Min.   : 5.41       Min.   : 0.000   Min.   : 0.00          Min.   : 3.058   Min.   : 8.772   Min.   : 0.4713   Min.   : 0.000   Min.   : 0.000   Min.   : 0.0000   Min.   : 3.144     Min.   : 1.335   Min.   :  0.00   Min.   :26.86  
##  Class :character   Class :character   Class :character   Class :character   1st Qu.:32.80   1st Qu.: 2.5518      1st Qu.: 3.572       1st Qu.: 7.115   1st Qu.:10.26       1st Qu.: 1.729   1st Qu.:38.07          1st Qu.:32.269   1st Qu.:35.491   1st Qu.:13.4215   1st Qu.: 4.801   1st Qu.: 5.259   1st Qu.: 0.0000   1st Qu.:69.402     1st Qu.:24.272   1st Qu.: 51.92   1st Qu.:68.82  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character   Median :37.20   Median : 3.8400      Median : 4.924       Median : 8.650   Median :12.79       Median : 5.467   Median :67.88          Median :42.618   Median :42.688   Median :21.2064   Median : 7.232   Median : 9.166   Median : 0.7833   Median :77.762     Median :39.391   Median : 68.99   Median :75.65  
##                                                                              Mean   :37.81   Mean   : 4.6502      Mean   : 5.260       Mean   : 9.431   Mean   :14.04       Mean   :15.471   Mean   :60.27          Mean   :42.350   Mean   :42.955   Mean   :25.3015   Mean   : 7.424   Mean   :10.435   Mean   : 3.0807   Mean   :74.849     Mean   :42.076   Mean   : 64.48   Mean   :74.51  
##                                                                              3rd Qu.:42.55   3rd Qu.: 5.8768      3rd Qu.: 6.447       3rd Qu.:10.730   3rd Qu.:16.12       3rd Qu.:19.079   3rd Qu.:85.73          3rd Qu.:53.228   3rd Qu.:50.442   3rd Qu.:33.0601   3rd Qu.: 9.677   3rd Qu.:14.095   3rd Qu.: 3.0265   3rd Qu.:83.937     3rd Qu.:57.313   3rd Qu.: 81.31   3rd Qu.:81.82  
##                                                                              Max.   :70.00   Max.   :30.5225      Max.   :18.531       Max.   :30.170   Max.   :40.01       Max.   :99.358   Max.   :98.92          Max.   :73.157   Max.   :80.645   Max.   :88.7425   Max.   :31.705   Max.   :47.056   Max.   :65.6162   Max.   :95.970     Max.   :97.086   Max.   :100.00   Max.   :95.03  
##   poverty.12mo       snap.12mo       denominator   
##  Min.   : 0.6641   Min.   : 0.000   Min.   :  635  
##  1st Qu.: 9.7102   1st Qu.: 6.578   1st Qu.: 2884  
##  Median :16.5067   Median :13.905   Median : 4167  
##  Mean   :19.4242   Mean   :16.916   Mean   : 4579  
##  3rd Qu.:27.1693   3rd Qu.:24.274   3rd Qu.: 5684  
##  Max.   :73.4463   Max.   :72.193   Max.   :29636
length(unique(testing.sites.acs.complete$GEOID))
## [1] 920

I join the census tract GEOIDs to the daily PM2.5 data, and clean and summarize the PM2.5 data by tract. Tracts with fewer than 50 days of valid data are excluded.

summary(clean.2019.dailyPM2.5)
##   state_code        county_code        site_number        parameter_code          poc            latitude       longitude          datum            parameter         sample_duration    pollutant_standard  date_local        units_of_measure    event_type        observation_count observation_percent validity_indicator arithmetic_mean   first_max_value   first_max_hour       aqi        
##  Length:1448813     Length:1448813     Length:1448813     Length:1448813     Min.   : 1.000   Min.   :25.47   Min.   :-124.18   Length:1448813     Length:1448813     Length:1448813     Length:1448813     Length:1448813     Length:1448813     Length:1448813     Min.   : 1.000    Min.   :  4.00      Length:1448813     Min.   : -5.200   Min.   : -5.000   Min.   : 0.00   Min.   :  0.00  
##  Class :character   Class :character   Class :character   Class :character   1st Qu.: 1.000   1st Qu.:35.73   1st Qu.:-109.56   Class :character   Class :character   Class :character   Class :character   Class :character   Class :character   Class :character   1st Qu.: 1.000    1st Qu.:100.00      Class :character   1st Qu.:  4.300   1st Qu.:  4.600   1st Qu.: 0.00   1st Qu.: 18.00  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character   Median : 3.000   Median :39.58   Median : -90.20   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Median : 1.000    Median :100.00      Mode  :character   Median :  6.500   Median :  7.100   Median : 0.00   Median : 27.00  
##                                                                              Mean   : 2.718   Mean   :38.90   Mean   : -94.36                                                                                                                                        Mean   : 4.656    Mean   : 99.72                         Mean   :  7.443   Mean   :  8.686   Mean   : 1.86   Mean   : 29.97  
##                                                                              3rd Qu.: 3.000   3rd Qu.:41.77   3rd Qu.: -81.24                                                                                                                                        3rd Qu.: 1.000    3rd Qu.:100.00                         3rd Qu.:  9.500   3rd Qu.: 10.900   3rd Qu.: 0.00   3rd Qu.: 40.00  
##                                                                              Max.   :33.000   Max.   :48.76   Max.   : -68.02                                                                                                                                        Max.   :24.000    Max.   :100.00                         Max.   :113.083   Max.   :896.000   Max.   :23.00   Max.   :181.00  
##                                                                                                                                                                                                                                                                                                                                                                                   NA's   :234562  
##  method_code           method          local_site_name    site_address          state              county              city           date_of_last_change  cbsa_code             cbsa             site_id         
##  Length:1448813     Length:1448813     Length:1448813     Length:1448813     Length:1448813     Length:1448813     Length:1448813     Length:1448813      Length:1448813     Length:1448813     Length:1448813    
##  Class :character   Class :character   Class :character   Class :character   Class :character   Class :character   Class :character   Class :character    Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character    Mode  :character   Mode  :character   Mode  :character  
##                                                                                                                                                                                                                   
##                                                                                                                                                                                                                   
##                                                                                                                                                                                                                   
## 
# Create a file of site IDs with GEOIDs that have complete ACS data.
site_id.GEOID.xwalk <- testing.sites.acs.complete %>%
                        dplyr::select(site_id,GEOID) %>%
                        unique()

GEOID.dailyPM2.5 <- inner_join( site_id.GEOID.xwalk
                               ,clean.2019.dailyPM2.5
                               ,by="site_id")

dim(GEOID.dailyPM2.5)
## [1] 1396734      33
# 1,396,734 rows

dim(unique(GEOID.dailyPM2.5[,c("GEOID","date_local")]))
## [1] 243809      2
# 243,809 tract-days

table(GEOID.dailyPM2.5$event_type, useNA="always")
## 
## Excluded Included     None     <NA> 
##     2200    55291  1339243        0
# Included, Excluded, None
table(GEOID.dailyPM2.5$validity_indicator, useNA="always")
## 
##       N       Y    <NA> 
##    4282 1392452       0
table(GEOID.dailyPM2.5$validity_indicator, GEOID.dailyPM2.5$event_type, useNA="always")
##       
##        Excluded Included    None    <NA>
##   N           3      221    4058       0
##   Y        2197    55070 1335185       0
##   <NA>        0        0       0       0
# Y, N. 4282 x N
# Excluding validity_indicator == "N" does not get rid of a meaningful % of events
# GEOID.dailyPM2.5 %>% dplyr::filter(validity_indicator == "N") %>% View()

summary(GEOID.dailyPM2.5$arithmetic_mean)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -5.200   4.292   6.463   7.418   9.500 113.083
# Create preliminary outcome variables
GEOID.daily.summary <-
  GEOID.dailyPM2.5 %>%
  dplyr::filter(validity_indicator=="Y") %>%
  mutate(arithmetic_mean = case_when(arithmetic_mean < 0 ~ 0, TRUE ~ arithmetic_mean)) %>%
  group_by(GEOID,date_local) %>%
  summarise(mean = mean(arithmetic_mean),.groups = "keep")

# table(GEOID.daily.summary$any_event)
summary(GEOID.daily.summary$mean)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   4.218   6.407   7.354   9.403 113.017
GEOID.AQI.outcomes <-
  GEOID.daily.summary %>%
  mutate(ungreen = case_when(mean > 12.0 ~ 1L, TRUE ~ 0L)) %>%
  group_by(GEOID) %>%
  summarise( pm2.5_days         = n()
            ,pm2.5_p50          = quantile(mean, 0.50)
            ,pm2.5_ungreen_days = sum(ungreen)
            ,.groups="keep")

head(GEOID.AQI.outcomes)
## # A tibble: 6 x 4
## # Groups:   GEOID [6]
##   GEOID       pm2.5_days pm2.5_p50 pm2.5_ungreen_days
##   <chr>            <int>     <dbl>              <int>
## 1 01003011102        107      7.3                   8
## 2 01027959000        107      7.3                  10
## 3 01033020701         68      6.75                  4
## 4 01049960700        108      7.15                  5
## 5 01055001600        114      7.9                  15
## 6 01069040600        115      7.5                  16
summary(GEOID.AQI.outcomes)
##     GEOID             pm2.5_days      pm2.5_p50       pm2.5_ungreen_days
##  Length:920         Min.   :  6.0   Min.   : 0.2236   Min.   :  0.00    
##  Class :character   1st Qu.:119.0   1st Qu.: 5.5000   1st Qu.: 11.00    
##  Mode  :character   Median :341.0   Median : 6.7048   Median : 23.00    
##                     Mean   :261.5   Mean   : 6.6179   Mean   : 33.54    
##                     3rd Qu.:358.0   3rd Qu.: 7.7808   3rd Qu.: 49.00    
##                     Max.   :365.0   Max.   :13.6167   Max.   :227.00
# How many days of valid data do we have per tract?
table(GEOID.AQI.outcomes$pm2.5_days)
## 
##   6  27  28  34  37  39  45  46  49  50  53  54  56  57  58  59  60  61  62  64  66  68  70  75  84  85  89  90  93  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 125 127 131 142 143 146 151 152 157 159 164 168 169 171 176 189 191 195 200 203 208 209 211 213 214 226 227 234 236 239 241 244 246 248 261 262 269 270 271 272 273 274 275 277 
##   1   2   1   1   1   1   1   1   2   2   3   1   4   1   5   7   3   7   2   3   1   2   1   1   1   2   1   3   1   1   2   1   1   2   1   2   1   3   5   4   7   1   3   6  15   6  13  18  18  14  28  28  30  30   3   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   2   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   3   1   1   1   1   1   1 
## 284 286 288 290 291 292 293 299 300 303 304 305 306 307 308 309 310 311 312 313 315 316 318 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 
##   2   1   1   1   2   1   1   1   1   2   3   2   1   1   1   1   1   1   2   2   1   5   2   2   1   2   1   1   2   2   3   2   2   5   4   3   4   5   3   3   5   3   8   5   6   5   5   8   6   8  11  10  10  17  13  14  17  25  20  30  20  31  32  30  24  21  28  18  56
# Create final PM2.5 outcome variables
AQI.outcomes.final <-
  GEOID.AQI.outcomes %>%
  dplyr::filter(pm2.5_days >= 50) %>%
  mutate( pm2.5_pct_ungreen    = 100 * pm2.5_ungreen_days / pm2.5_days
         ,pm2.5_flag10_ungreen = case_when(pm2.5_ungreen_days * 10 > pm2.5_days ~ 1L, TRUE ~ 0L)
         ,pm2.5_flag20_ungreen = case_when(pm2.5_ungreen_days *  5 > pm2.5_days ~ 1L, TRUE ~ 0L)
         ) %>%
  mutate( pm2.5_flag10_ungreen = factor(pm2.5_flag10_ungreen
                                        ,levels = c(0,1)
                                        ,labels = c("0-10%",">10%"))
         ,pm2.5_flag20_ungreen = factor(pm2.5_flag20_ungreen
                                        ,levels = c(0,1)
                                        ,labels = c("0-20%",">20%"))) %>%
  dplyr::select(GEOID, pm2.5_p50, pm2.5_pct_ungreen, pm2.5_flag10_ungreen, pm2.5_flag20_ungreen)

# Review created outcome variables
head(AQI.outcomes.final)
## # A tibble: 6 x 5
## # Groups:   GEOID [6]
##   GEOID       pm2.5_p50 pm2.5_pct_ungreen pm2.5_flag10_ungreen pm2.5_flag20_ungreen
##   <chr>           <dbl>             <dbl> <fct>                <fct>               
## 1 01003011102      7.3               7.48 0-10%                0-20%               
## 2 01027959000      7.3               9.35 0-10%                0-20%               
## 3 01033020701      6.75              5.88 0-10%                0-20%               
## 4 01049960700      7.15              4.63 0-10%                0-20%               
## 5 01055001600      7.9              13.2  >10%                 0-20%               
## 6 01069040600      7.5              13.9  >10%                 0-20%
summary(AQI.outcomes.final$pm2.5_p50)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2236  5.5000  6.7071  6.6228  7.7810 13.6167
summary(AQI.outcomes.final$pm2.5_pct_ungreen)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.882  12.045  13.120  18.103  62.192
table(AQI.outcomes.final$pm2.5_flag10_ungreen)
## 
## 0-10%  >10% 
##   372   537
table(AQI.outcomes.final$pm2.5_flag20_ungreen)
## 
## 0-20%  >20% 
##   741   168
AQI.outcomes.final %>% dplyr::filter(pm2.5_flag10_ungreen==">10%") %>% summary()
##     GEOID             pm2.5_p50      pm2.5_pct_ungreen pm2.5_flag10_ungreen pm2.5_flag20_ungreen
##  Length:537         Min.   : 3.710   Min.   :10.06     0-10%:  0            0-20%:369           
##  Class :character   1st Qu.: 6.707   1st Qu.:13.22     >10% :537            >20% :168           
##  Mode  :character   Median : 7.562   Median :16.67                                              
##                     Mean   : 7.620   Mean   :18.82                                              
##                     3rd Qu.: 8.434   3rd Qu.:21.78                                              
##                     Max.   :13.617   Max.   :62.19
AQI.outcomes.final %>% dplyr::filter(pm2.5_flag20_ungreen==">20%") %>% summary()
##     GEOID             pm2.5_p50      pm2.5_pct_ungreen pm2.5_flag10_ungreen pm2.5_flag20_ungreen
##  Length:168         Min.   : 4.800   Min.   :20.11     0-10%:  0            0-20%:  0           
##  Class :character   1st Qu.: 8.280   1st Qu.:22.11     >10% :168            >20% :168           
##  Mode  :character   Median : 8.801   Median :25.79                                              
##                     Mean   : 8.893   Mean   :27.81                                              
##                     3rd Qu.: 9.500   3rd Qu.:30.73                                              
##                     Max.   :13.617   Max.   :62.19

With both predictors and outcomes summarized by census tract, the ACS and AQS files can now be joined to create the final data set.

str(AQI.outcomes.final)
## tibble [909 × 5] (S3: grouped_df/tbl_df/tbl/data.frame)
##  $ GEOID               : chr [1:909] "01003011102" "01027959000" "01033020701" "01049960700" ...
##  $ pm2.5_p50           : Named num [1:909] 7.3 7.3 6.75 7.15 7.9 ...
##   ..- attr(*, "names")= chr [1:909] "50%" "50%" "50%" "50%" ...
##  $ pm2.5_pct_ungreen   : num [1:909] 7.48 9.35 5.88 4.63 13.16 ...
##  $ pm2.5_flag10_ungreen: Factor w/ 2 levels "0-10%",">10%": 1 1 1 1 2 2 2 2 2 2 ...
##  $ pm2.5_flag20_ungreen: Factor w/ 2 levels "0-20%",">20%": 1 1 1 1 1 1 2 2 1 1 ...
##  - attr(*, "groups")= tibble [909 × 2] (S3: tbl_df/tbl/data.frame)
##   ..$ GEOID: chr [1:909] "01003011102" "01027959000" "01033020701" "01049960700" ...
##   ..$ .rows: list<int> [1:909] 
##   .. ..$ : int 1
##   .. ..$ : int 2
##   .. ..$ : int 3
##   .. ..$ : int 4
##   .. ..$ : int 5
##   .. ..$ : int 6
##   .. ..$ : int 7
##   .. ..$ : int 8
##   .. ..$ : int 9
##   .. ..$ : int 10
##   .. ..$ : int 11
##   .. ..$ : int 12
##   .. ..$ : int 13
##   .. ..$ : int 14
##   .. ..$ : int 15
##   .. ..$ : int 16
##   .. ..$ : int 17
##   .. ..$ : int 18
##   .. ..$ : int 19
##   .. ..$ : int 20
##   .. ..$ : int 21
##   .. ..$ : int 22
##   .. ..$ : int 23
##   .. ..$ : int 24
##   .. ..$ : int 25
##   .. ..$ : int 26
##   .. ..$ : int 27
##   .. ..$ : int 28
##   .. ..$ : int 29
##   .. ..$ : int 30
##   .. ..$ : int 31
##   .. ..$ : int 32
##   .. ..$ : int 33
##   .. ..$ : int 34
##   .. ..$ : int 35
##   .. ..$ : int 36
##   .. ..$ : int 37
##   .. ..$ : int 38
##   .. ..$ : int 39
##   .. ..$ : int 40
##   .. ..$ : int 41
##   .. ..$ : int 42
##   .. ..$ : int 43
##   .. ..$ : int 44
##   .. ..$ : int 45
##   .. ..$ : int 46
##   .. ..$ : int 47
##   .. ..$ : int 48
##   .. ..$ : int 49
##   .. ..$ : int 50
##   .. ..$ : int 51
##   .. ..$ : int 52
##   .. ..$ : int 53
##   .. ..$ : int 54
##   .. ..$ : int 55
##   .. ..$ : int 56
##   .. ..$ : int 57
##   .. ..$ : int 58
##   .. ..$ : int 59
##   .. ..$ : int 60
##   .. ..$ : int 61
##   .. ..$ : int 62
##   .. ..$ : int 63
##   .. ..$ : int 64
##   .. ..$ : int 65
##   .. ..$ : int 66
##   .. ..$ : int 67
##   .. ..$ : int 68
##   .. ..$ : int 69
##   .. ..$ : int 70
##   .. ..$ : int 71
##   .. ..$ : int 72
##   .. ..$ : int 73
##   .. ..$ : int 74
##   .. ..$ : int 75
##   .. ..$ : int 76
##   .. ..$ : int 77
##   .. ..$ : int 78
##   .. ..$ : int 79
##   .. ..$ : int 80
##   .. ..$ : int 81
##   .. ..$ : int 82
##   .. ..$ : int 83
##   .. ..$ : int 84
##   .. ..$ : int 85
##   .. ..$ : int 86
##   .. ..$ : int 87
##   .. ..$ : int 88
##   .. ..$ : int 89
##   .. ..$ : int 90
##   .. ..$ : int 91
##   .. ..$ : int 92
##   .. ..$ : int 93
##   .. ..$ : int 94
##   .. ..$ : int 95
##   .. ..$ : int 96
##   .. ..$ : int 97
##   .. ..$ : int 98
##   .. ..$ : int 99
##   .. .. [list output truncated]
##   .. ..@ ptype: int(0) 
##   ..- attr(*, ".drop")= logi TRUE
str(testing.sites.acs.complete)
## 'data.frame':    931 obs. of  24 variables:
##  $ site_id               : chr  "010550010" "010970003" "010030010" "011011002" ...
##  $ GEOID                 : chr  "01055001600" "01097005300" "01003011102" "01101002500" ...
##  $ STATEFP               : chr  "01" "01" "01" "01" ...
##  $ COUNTYFP              : chr  "055" "097" "003" "101" ...
##  $ age.median            : num  37.1 34.2 43 35.3 47.1 39.5 40.4 33.4 33.7 38.3 ...
##  $ commute.time.agg.10k  : num  2.39 1.3 4.66 1.61 2.17 ...
##  $ income.hh.median.10k  : num  3.62 3.28 6.49 3.41 2.69 ...
##  $ rent.median.100       : num  8.41 5.98 9.37 7.8 3.94 6.04 8.11 7.87 6.85 9.49 ...
##  $ home.pmt.median.100   : num  8.4 10.52 13.81 8.56 9.71 ...
##  $ race.black.any        : num  68.61 37.87 2.17 66.38 17.01 ...
##  $ ethn.nhisp.white.alone: num  30.6 57.3 89.1 21.6 72 ...
##  $ married               : num  28.2 33.5 60.9 33.3 42 ...
##  $ kids                  : num  30.6 49.2 44.2 48.3 31.7 ...
##  $ bachelor.plus         : num  18 16.1 39.1 19.5 13.3 ...
##  $ veteran               : num  10.02 9.1 10.67 6.56 7.41 ...
##  $ manufacturing         : num  34.38 13.77 9.52 16.85 31.61 ...
##  $ farm.fish.mining      : num  1.77 0 0 0 0 ...
##  $ commutes.car.alone    : num  86.5 88.3 83.8 92.7 90.9 ...
##  $ renter                : num  28.71 45.43 7.48 54.51 30.97 ...
##  $ single.fam.home       : num  93.7 75.1 83.7 90.1 67.4 ...
##  $ worked.12mo           : num  63.8 61.2 75.6 66.7 64.2 ...
##  $ poverty.12mo          : num  28.68 27.46 7.23 24.61 37.42 ...
##  $ snap.12mo             : num  12.66 22.88 6.37 34.27 17.42 ...
##  $ denominator           : num  3300 1970 4385 2317 2975 ...
# Drop unnecessary geographic variables from the ACS file
acs.joinready <- 
  testing.sites.acs.complete %>%
  dplyr::select(-site_id, -STATEFP, -COUNTYFP) %>%
  unique()

dim(acs.joinready)
## [1] 920  21
length(unique(acs.joinready$GEOID))
## [1] 920
# 920 rows, unique on GEOID

# Join PM2.5 and ACS data by tract on GEOID
AQI.ACS <- inner_join(AQI.outcomes.final,acs.joinready,by="GEOID")
head(AQI.ACS)
## # A tibble: 6 x 25
## # Groups:   GEOID [6]
##   GEOID       pm2.5_p50 pm2.5_pct_ungreen pm2.5_flag10_ungreen pm2.5_flag20_ungreen age.median commute.time.agg.10k income.hh.median.10k rent.median.100 home.pmt.median.100 race.black.any ethn.nhisp.white.alone married  kids bachelor.plus veteran manufacturing farm.fish.mining commutes.car.alone renter single.fam.home worked.12mo poverty.12mo snap.12mo denominator
##   <chr>           <dbl>             <dbl> <fct>                <fct>                     <dbl>                <dbl>                <dbl>           <dbl>               <dbl>          <dbl>                  <dbl>   <dbl> <dbl>         <dbl>   <dbl>         <dbl>            <dbl>              <dbl>  <dbl>           <dbl>       <dbl>        <dbl>     <dbl>       <dbl>
## 1 01003011102      7.3               7.48 0-10%                0-20%                      43                   4.66                 6.49            9.37               13.8            2.17                   89.1    60.9  44.2         39.1    10.7           9.52            0                   83.8   7.48            83.7        75.6         7.23      6.37        4385
## 2 01027959000      7.3               9.35 0-10%                0-20%                      47.1                 2.17                 2.69            3.94                9.71          17.0                    72.0    42.0  31.7         13.3     7.41         31.6             0                   90.9  31.0             67.4        64.2        37.4      17.4         2975
## 3 01033020701      6.75              5.88 0-10%                0-20%                      44.1                 5.96                 6.34            7.61               11.9           11.1                    82.8    55.0  33.1         24.1    10.4          23.1             0                   95.5  13.6             76.2        73.2         8.67      4.50        6224
## 4 01049960700      7.15              4.63 0-10%                0-20%                      33.7                10.3                  3.49            6.85                9.56           1.01                   59.9    56.3  46.4          8.55    3.65         36.2             2.47                86.9  30.6             53.8        62.5        28.8      17.5         9067
## 5 01055001600      7.9              13.2  >10%                 0-20%                      37.1                 2.39                 3.62            8.41                8.4           68.6                    30.6    28.2  30.6         18.0    10.0          34.4             1.77                86.5  28.7             93.7        63.8        28.7      12.7         3300
## 6 01069040600      7.5              13.9  >10%                 0-20%                      33.8                 1.04                 1.46            5.62                7.13          85.6                    12.7    12.8  52.5          4.73    5.78         12.8             0.753               74.0  76.5             55.0        54.1        57.2      45.2         2021
# Prepare the geography/geometry file for linkage
str(testing.sites.tracts)
## Classes 'sf' and 'data.frame':   965 obs. of  23 variables:
##  $ STATEFP        : chr  "01" "01" "01" "01" ...
##  $ COUNTYFP       : chr  "055" "097" "003" "101" ...
##  $ TRACTCE        : chr  "001600" "005300" "011102" "002500" ...
##  $ GEOID          : chr  "01055001600" "01097005300" "01003011102" "01101002500" ...
##  $ NAME           : chr  "16" "53" "111.02" "25" ...
##  $ NAMELSAD       : chr  "Census Tract 16" "Census Tract 53" "Census Tract 111.02" "Census Tract 25" ...
##  $ MTFCC          : chr  "G5020" "G5020" "G5020" "G5020" ...
##  $ FUNCSTAT       : chr  "S" "S" "S" "S" ...
##  $ ALAND          : num  1.05e+07 4.99e+06 4.42e+07 4.49e+06 1.80e+08 ...
##  $ AWATER         : num  1540235 294542 813845 37253 1129596 ...
##  $ INTPTLAT       : chr  "+33.9780622" "+30.7789148" "+30.4799296" "+32.4177882" ...
##  $ INTPTLON       : chr  "-085.9775275" "-088.0797620" "-087.8330813" "-086.2722538" ...
##  $ state_code     : chr  "01" "01" "01" "01" ...
##  $ state          : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ county_code    : chr  "055" "097" "003" "101" ...
##  $ county         : chr  "Etowah" "Mobile" "Baldwin" "Montgomery" ...
##  $ city           : chr  "Gadsden" "Chickasaw" "Fairhope" "Montgomery" ...
##  $ cbsa_code      : chr  "23460" "33660" "19300" "33860" ...
##  $ cbsa           : chr  "Gadsden, AL" "Mobile, AL" "Daphne-Fairhope-Foley, AL" "Montgomery, AL" ...
##  $ site_number    : chr  "0010" "0003" "0010" "1002" ...
##  $ local_site_name: chr  "GADSDEN C. COLLEGE" "CHICKASAW" "FAIRHOPE, Alabama" "MOMS, ADEM" ...
##  $ site_address   : chr  "1001 WALLACE DRIVE, GADSDEN, AL 35902" "Iroquois and Azalea, CHICKASAW, MOBILE CO.,  ALABAMA" "FAIRHOPE HIGH SCHOOL, 1 PIRATE DRIVE, FAIRHOPE,  ALABAMA" "1350 COLISEUM BLVD, MONTGOMERY, ALABAMA" ...
##  $ geometry       :sfc_GEOMETRY of length 965; first list element: List of 1
##   ..$ :List of 1
##   .. ..$ : num [1:424, 1:2] -86 -86 -86 -86 -86 ...
##   ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
##   ..- attr(*, "names")= chr [1:22] "STATEFP" "COUNTYFP" "TRACTCE" "GEOID" ...
geom.joinready <-
  testing.sites.tracts %>%
  dplyr::select(GEOID,NAMELSAD,STATEFP,state,COUNTYFP,county,city,cbsa,cbsa_code,geometry) %>%
  unique()
dim(geom.joinready)
## [1] 954  10
length(unique(geom.joinready$GEOID))
## [1] 953
geom.joinready %>% group_by(GEOID) %>% summarise(obs=n()) %>% arrange(-obs) %>% head()
## `summarise()` ungrouping output (override with `.groups` argument)
## Simple feature collection with 6 features and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -120.6506 ymin: 30.43231 xmax: -85.77036 ymax: 35.11134
## geographic CRS: NAD83
## # A tibble: 6 x 3
##   GEOID         obs                                                                                geometry
##   <chr>       <int>                                                                      <MULTIPOLYGON [°]>
## 1 06079012304     2 (((-120.6504 34.97578, -120.6452 34.9935, -120.6436 34.99941, -120.6433 34.99998, -1...
## 2 01003011102     1 (((-87.88641 30.47292, -87.88641 30.47597, -87.8864 30.47612, -87.8864 30.47784, -87...
## 3 01027959000     1 (((-86.01156 33.29432, -86.00613 33.2943, -86.00355 33.29429, -85.99997 33.29427, -8...
## 4 01033020701     1 (((-87.67893 34.70889, -87.67891 34.71478, -87.67891 34.71493, -87.67794 34.7148, -8...
## 5 01049960700     1 (((-86.10822 34.29816, -86.10801 34.29831, -86.10724 34.29879, -86.10644 34.29897, -...
## 6 01055001600     1 (((-85.99934 33.98564, -85.99933 33.98592, -85.99927 33.9864, -85.99918 33.98703, -8...
geom.joinready %>% dplyr::filter(GEOID=="06079012304")
## Simple feature collection with 2 features and 9 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -120.6506 ymin: 34.95956 xmax: -120.381 ymax: 35.11134
## geographic CRS: NAD83
##              GEOID            NAMELSAD STATEFP      state COUNTYFP          county          city                                          cbsa cbsa_code                       geometry
## 7797   06079012304 Census Tract 123.04      06 California      079 San Luis Obispo        Nipomo San Luis Obispo-Paso Robles-Arroyo Grande, CA     42020 MULTIPOLYGON (((-120.6504 3...
## 7797.1 06079012304 Census Tract 123.04      06 California      079 San Luis Obispo Arroyo Grande San Luis Obispo-Paso Robles-Arroyo Grande, CA     42020 MULTIPOLYGON (((-120.6504 3...
# Hard-code one value to be able to keep city attribute unique by GEOID
geom.joinready2 <- 
  geom.joinready %>%
  mutate(city = case_when(GEOID=="06079012304" ~ "Arroyo Grande", TRUE ~ city)) %>%
  unique()

dim(geom.joinready2)
## [1] 953  10
length(unique(geom.joinready2$GEOID))
## [1] 953
head(geom.joinready2)
## Simple feature collection with 6 features and 9 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -88.10077 ymin: 30.43231 xmax: -84.96303 ymax: 34.00093
## geographic CRS: NAD83
##         GEOID            NAMELSAD STATEFP   state COUNTYFP     county        city                      cbsa cbsa_code                       geometry
## 1 01055001600     Census Tract 16      01 Alabama      055     Etowah     Gadsden               Gadsden, AL     23460 MULTIPOLYGON (((-85.99934 3...
## 2 01097005300     Census Tract 53      01 Alabama      097     Mobile   Chickasaw                Mobile, AL     33660 MULTIPOLYGON (((-88.10077 3...
## 3 01003011102 Census Tract 111.02      01 Alabama      003    Baldwin    Fairhope Daphne-Fairhope-Foley, AL     19300 MULTIPOLYGON (((-87.88641 3...
## 4 01101002500     Census Tract 25      01 Alabama      101 Montgomery  Montgomery            Montgomery, AL     33860 MULTIPOLYGON (((-86.28735 3...
## 5 01027959000   Census Tract 9590      01 Alabama      027       Clay     Ashland                      <NA>      <NA> MULTIPOLYGON (((-86.01156 3...
## 6 01113030800    Census Tract 308      01 Alabama      113    Russell Phenix City           Columbus, GA-AL     17980 MULTIPOLYGON (((-85.01535 3...
# Join geography/geometry file with AQS outcomes and ACS variables
AQI.ACS.geom <- inner_join(geom.joinready2,AQI.ACS,by="GEOID") %>% group_by()
head(AQI.ACS.geom)
## Simple feature collection with 6 features and 33 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -88.10077 ymin: 30.43231 xmax: -84.96303 ymax: 34.00093
## geographic CRS: NAD83
## # A tibble: 6 x 34
##   GEOID NAMELSAD STATEFP state COUNTYFP county city  cbsa  cbsa_code pm2.5_p50 pm2.5_pct_ungre… pm2.5_flag10_un… pm2.5_flag20_un… age.median commute.time.ag… income.hh.media… rent.median.100 home.pmt.median… race.black.any ethn.nhisp.whit… married  kids bachelor.plus veteran manufacturing farm.fish.mining commutes.car.al… renter single.fam.home worked.12mo poverty.12mo snap.12mo denominator
##   <chr> <chr>    <chr>   <chr> <chr>    <chr>  <chr> <chr> <chr>         <dbl>            <dbl> <fct>            <fct>                 <dbl>            <dbl>            <dbl>           <dbl>            <dbl>          <dbl>            <dbl>   <dbl> <dbl>         <dbl>   <dbl>         <dbl>            <dbl>            <dbl>  <dbl>           <dbl>       <dbl>        <dbl>     <dbl>       <dbl>
## 1 0105… Census … 01      Alab… 055      Etowah Gads… Gads… 23460          7.9             13.2  >10%             0-20%                  37.1             2.39             3.62            8.41             8.4           68.6             30.6     28.2  30.6         18.0    10.0          34.4              1.77             86.5  28.7             93.7        63.8        28.7      12.7         3300
## 2 0109… Census … 01      Alab… 097      Mobile Chic… Mobi… 33660          8                9.82 0-10%            0-20%                  34.2             1.30             3.28            5.98            10.5           37.9             57.3     33.5  49.2         16.1     9.10         13.8              0                88.3  45.4             75.1        61.2        27.5      22.9         1970
## 3 0100… Census … 01      Alab… 003      Baldw… Fair… Daph… 19300          7.3              7.48 0-10%            0-20%                  43               4.66             6.49            9.37            13.8            2.17            89.1     60.9  44.2         39.1    10.7           9.52             0                83.8   7.48            83.7        75.6         7.23      6.37        4385
## 4 0110… Census … 01      Alab… 101      Montg… Mont… Mont… 33860          8.45            19.3  >10%             0-20%                  35.3             1.61             3.41            7.8              8.56          66.4             21.6     33.3  48.3         19.5     6.56         16.8              0                92.7  54.5             90.1        66.7        24.6      34.3         2317
## 5 0102… Census … 01      Alab… 027      Clay   Ashl… <NA>  <NA>           7.3              9.35 0-10%            0-20%                  47.1             2.17             2.69            3.94             9.71          17.0             72.0     42.0  31.7         13.3     7.41         31.6              0                90.9  31.0             67.4        64.2        37.4      17.4         2975
## 6 0111… Census … 01      Alab… 113      Russe… Phen… Colu… 17980          8.61            25.6  >10%             >20%                   39.5             2.59             2.10            6.04             9.44          92.9              6.52    21.2  25.7          4.31    9.70         17.0              0                85.3  46.4             71.7        58.0        45.1      28.7         3254
## # … with 1 more variable: geometry <MULTIPOLYGON [°]>
class(AQI.ACS.geom)
## [1] "sf"         "tbl_df"     "tbl"        "data.frame"

Results

Describe your results and include relevant tables, plots, and code/comments used to obtain them. End with a brief conclusion of your findings related to the question you set out to address. You can include references if you’d like, but this is not required.

XXXXXXXXXXXXXXXXXPerform bivariate tests of community characteristics vs air quality (continous measures and 1/0 for poor air quality above threshold). Create multivariate models to predict air quality from community characteristics. Construct a classifier using community characteristics and evaluate its performance against the real data. XXXXXXXXXXXXXXXXX

Continuous PM2.5 outcomes by tract, mapped:

AQI.ACS.geom.WGS84 <- st_transform(AQI.ACS.geom, 4326) # FROM 4269 TO 4326 to cooperate with leaflet

palette <- colorNumeric("viridis", NULL, reverse=TRUE)

# Median PM2.5 by tract - interactive
popup_msg <- paste0(AQI.ACS.geom.WGS84$county," County, ",AQI.ACS.geom.WGS84$state,", ",AQI.ACS.geom.WGS84$NAMELSAD
                   ,"<br>Metro Area: ",AQI.ACS.geom.WGS84$cbsa
                   ,"<br>Median PM2.5: ",round(AQI.ACS.geom.WGS84$pm2.5_p50,digits=2))

leaflet(AQI.ACS.geom.WGS84) %>%
  addPolygons(fillColor = ~palette(pm2.5_p50)
              ,fillOpacity = 0.9
              ,weight = 1.5
              ,color = "black"
              ,popup = popup_msg) %>%      
  addTiles() %>%
  addLegend("bottomright",         
            pal=palette,      
            values=~pm2.5_p50,   
            title = 'PM2.5',  
            opacity = 1) %>%    
  addScaleBar()
# % of days above 12 mcg/m3 by tract - interactive
popup_msg2 <- paste0(AQI.ACS.geom.WGS84$county," County, ",AQI.ACS.geom.WGS84$state,", ",AQI.ACS.geom.WGS84$NAMELSAD
                   ,"<br>Metro Area: ",AQI.ACS.geom.WGS84$cbsa
                   ,"<br>Reached 12mcg/m3 on ",round(AQI.ACS.geom.WGS84$pm2.5_pct_ungreen,digits=1),"% of days")

leaflet(AQI.ACS.geom.WGS84) %>%
  addPolygons(fillColor = ~palette(pm2.5_pct_ungreen)
              ,fillOpacity = 0.9
              ,weight = 1.5
              ,color = "black"
              ,popup = popup_msg2) %>%      
  addTiles() %>%
  addLegend("bottomright",         
            pal=palette,      
            values=~pm2.5_pct_ungreen,   
            title = '% days of 12+ mcg/m3 PM2.5',  
            opacity = 1) %>%    
  addScaleBar()

Regression results, using all 19 ACS-based predictor variables:

# Run linear regression for continuous outcomes
lm.p50 <- lm(pm2.5_p50 ~ 
               age.median + commute.time.agg.10k + income.hh.median.10k + rent.median.100 
             + home.pmt.median.100 + race.black.any + ethn.nhisp.white.alone + married + kids 
             + bachelor.plus + veteran + manufacturing + farm.fish.mining + commutes.car.alone 
             + renter + single.fam.home + worked.12mo + poverty.12mo + snap.12mo
             ,data = AQI.ACS.geom)

lm.pct.ungreen <- lm(pm2.5_pct_ungreen ~ 
               age.median + commute.time.agg.10k + income.hh.median.10k + rent.median.100 
             + home.pmt.median.100 + race.black.any + ethn.nhisp.white.alone + married + kids 
             + bachelor.plus + veteran + manufacturing + farm.fish.mining + commutes.car.alone 
             + renter + single.fam.home + worked.12mo + poverty.12mo + snap.12mo
             ,data = AQI.ACS.geom)

# Run logistic regression for binary outcomes
logit.flag10.ungreen <- glm(pm2.5_flag10_ungreen ~ 
               age.median + commute.time.agg.10k + income.hh.median.10k + rent.median.100 
             + home.pmt.median.100 + race.black.any + ethn.nhisp.white.alone + married + kids 
             + bachelor.plus + veteran + manufacturing + farm.fish.mining + commutes.car.alone 
             + renter + single.fam.home + worked.12mo + poverty.12mo + snap.12mo
             ,data = AQI.ACS.geom
             ,family = binomial())

logit.flag20.ungreen <- glm(pm2.5_flag20_ungreen ~ 
               age.median + commute.time.agg.10k + income.hh.median.10k + rent.median.100 
             + home.pmt.median.100 + race.black.any + ethn.nhisp.white.alone + married + kids 
             + bachelor.plus + veteran + manufacturing + farm.fish.mining + commutes.car.alone 
             + renter + single.fam.home + worked.12mo + poverty.12mo + snap.12mo
             ,data = AQI.ACS.geom
             ,family = binomial())

# Extract p-values by predictor variable from regression results into a summary data frame.
lm.p50.pval.df <- data.frame(acs.var = names(summary(lm.p50)$coefficients[-1,4])
                             ,lm.p50.pval  = summary(lm.p50)$coefficients[-1,4])

lm.pct.ungreen.pval.df <- data.frame(acs.var = names(summary(lm.pct.ungreen)$coefficients[-1,4])
                                    ,lm.pct.ungreen.pval   = summary(lm.pct.ungreen)$coefficients[-1,4])

logit.flag10.ungreen.pval.df <- data.frame(acs.var = names(summary(logit.flag10.ungreen)$coefficients[-1,4])
                                    ,logit.flag10.ungreen.pval = summary(logit.flag10.ungreen)$coefficients[-1,4])

logit.flag20.ungreen.pval.df <- data.frame(acs.var = names(summary(logit.flag20.ungreen)$coefficients[-1,4])
                                    ,logit.flag20.ungreen.pval = summary(logit.flag20.ungreen)$coefficients[-1,4])

# Join p-value data frames together by ACS-based variable name. 
# Sort ascending by p-value for median PM2.5 regression
pval1 <- inner_join(lm.p50.pval.df,lm.pct.ungreen.pval.df, by="acs.var")
pval2 <- inner_join(pval1,logit.flag10.ungreen.pval.df, by="acs.var")
pval3 <- inner_join(pval2,logit.flag20.ungreen.pval.df, by="acs.var") %>% arrange(lm.p50.pval)

# Return results
summary.lm(lm.p50)
## 
## Call:
## lm(formula = pm2.5_p50 ~ age.median + commute.time.agg.10k + 
##     income.hh.median.10k + rent.median.100 + home.pmt.median.100 + 
##     race.black.any + ethn.nhisp.white.alone + married + kids + 
##     bachelor.plus + veteran + manufacturing + farm.fish.mining + 
##     commutes.car.alone + renter + single.fam.home + worked.12mo + 
##     poverty.12mo + snap.12mo, data = AQI.ACS.geom)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3401 -1.0163 -0.0799  1.0319  6.7636 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            10.281525   1.188349   8.652  < 2e-16 ***
## age.median             -0.030768   0.012186  -2.525 0.011746 *  
## commute.time.agg.10k    0.011560   0.020144   0.574 0.566215    
## income.hh.median.10k   -0.020663   0.061624  -0.335 0.737473    
## rent.median.100         0.013211   0.027278   0.484 0.628282    
## home.pmt.median.100    -0.015226   0.018848  -0.808 0.419410    
## race.black.any          0.015403   0.003726   4.134 3.90e-05 ***
## ethn.nhisp.white.alone -0.007592   0.003167  -2.398 0.016705 *  
## married                 0.004083   0.008184   0.499 0.617959    
## kids                   -0.008251   0.006990  -1.180 0.238167    
## bachelor.plus           0.005070   0.006317   0.803 0.422426    
## veteran                -0.046590   0.018786  -2.480 0.013321 *  
## manufacturing           0.031802   0.008779   3.623 0.000308 ***
## farm.fish.mining       -0.046542   0.010051  -4.630 4.19e-06 ***
## commutes.car.alone      0.019526   0.005590   3.493 0.000501 ***
## renter                  0.004082   0.005638   0.724 0.469257    
## single.fam.home        -0.002814   0.003933  -0.716 0.474407    
## worked.12mo            -0.041731   0.008345  -5.000 6.89e-07 ***
## poverty.12mo           -0.011233   0.009332  -1.204 0.229001    
## snap.12mo              -0.002669   0.007966  -0.335 0.737633    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.592 on 889 degrees of freedom
## Multiple R-squared:  0.2524, Adjusted R-squared:  0.2364 
## F-statistic:  15.8 on 19 and 889 DF,  p-value: < 2.2e-16
summary(lm.pct.ungreen)
## 
## Call:
## lm(formula = pm2.5_pct_ungreen ~ age.median + commute.time.agg.10k + 
##     income.hh.median.10k + rent.median.100 + home.pmt.median.100 + 
##     race.black.any + ethn.nhisp.white.alone + married + kids + 
##     bachelor.plus + veteran + manufacturing + farm.fish.mining + 
##     commutes.car.alone + renter + single.fam.home + worked.12mo + 
##     poverty.12mo + snap.12mo, data = AQI.ACS.geom)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.932  -5.907  -1.368   4.229  47.795 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            27.820524   6.444245   4.317 1.76e-05 ***
## age.median             -0.096365   0.066081  -1.458 0.145117    
## commute.time.agg.10k    0.016552   0.109237   0.152 0.879597    
## income.hh.median.10k   -0.211577   0.334180  -0.633 0.526815    
## rent.median.100        -0.005321   0.147927  -0.036 0.971314    
## home.pmt.median.100    -0.061119   0.102213  -0.598 0.550021    
## race.black.any          0.022965   0.020205   1.137 0.256017    
## ethn.nhisp.white.alone -0.050822   0.017172  -2.960 0.003162 ** 
## married                 0.013969   0.044380   0.315 0.753017    
## kids                   -0.008858   0.037905  -0.234 0.815273    
## bachelor.plus           0.021526   0.034253   0.628 0.529887    
## veteran                -0.136190   0.101874  -1.337 0.181615    
## manufacturing           0.172555   0.047607   3.625 0.000306 ***
## farm.fish.mining       -0.134663   0.054507  -2.471 0.013677 *  
## commutes.car.alone      0.025827   0.030314   0.852 0.394454    
## renter                  0.029989   0.030575   0.981 0.326949    
## single.fam.home         0.022962   0.021327   1.077 0.281930    
## worked.12mo            -0.152661   0.045256  -3.373 0.000775 ***
## poverty.12mo           -0.061633   0.050605  -1.218 0.223584    
## snap.12mo               0.021525   0.043200   0.498 0.618417    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.632 on 889 degrees of freedom
## Multiple R-squared:  0.1533, Adjusted R-squared:  0.1353 
## F-statistic: 8.475 on 19 and 889 DF,  p-value: < 2.2e-16
summary(logit.flag10.ungreen)
## 
## Call:
## glm(formula = pm2.5_flag10_ungreen ~ age.median + commute.time.agg.10k + 
##     income.hh.median.10k + rent.median.100 + home.pmt.median.100 + 
##     race.black.any + ethn.nhisp.white.alone + married + kids + 
##     bachelor.plus + veteran + manufacturing + farm.fish.mining + 
##     commutes.car.alone + renter + single.fam.home + worked.12mo + 
##     poverty.12mo + snap.12mo, family = binomial(), data = AQI.ACS.geom)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4284  -1.0623   0.5540   0.9632   1.7815  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             5.0426702  1.7523543   2.878  0.00401 ** 
## age.median             -0.0306630  0.0171606  -1.787  0.07397 .  
## commute.time.agg.10k   -0.0077921  0.0271376  -0.287  0.77401    
## income.hh.median.10k    0.0237597  0.0839018   0.283  0.77704    
## rent.median.100        -0.0340125  0.0371454  -0.916  0.35984    
## home.pmt.median.100    -0.0275504  0.0263048  -1.047  0.29494    
## race.black.any          0.0151165  0.0059452   2.543  0.01100 *  
## ethn.nhisp.white.alone -0.0108105  0.0044240  -2.444  0.01454 *  
## married                 0.0052513  0.0115284   0.456  0.64874    
## kids                   -0.0061786  0.0099806  -0.619  0.53588    
## bachelor.plus           0.0118970  0.0087472   1.360  0.17380    
## veteran                -0.0267337  0.0259119  -1.032  0.30221    
## manufacturing           0.0727012  0.0138298   5.257 1.47e-07 ***
## farm.fish.mining       -0.0343879  0.0146051  -2.355  0.01855 *  
## commutes.car.alone     -0.0048214  0.0079629  -0.605  0.54486    
## renter                  0.0069986  0.0079342   0.882  0.37773    
## single.fam.home        -0.0008233  0.0056277  -0.146  0.88369    
## worked.12mo            -0.0349466  0.0123114  -2.839  0.00453 ** 
## poverty.12mo           -0.0245880  0.0134790  -1.824  0.06813 .  
## snap.12mo               0.0059546  0.0114358   0.521  0.60258    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1230.0  on 908  degrees of freedom
## Residual deviance: 1078.6  on 889  degrees of freedom
## AIC: 1118.6
## 
## Number of Fisher Scoring iterations: 4
summary(logit.flag20.ungreen)
## 
## Call:
## glm(formula = pm2.5_flag20_ungreen ~ age.median + commute.time.agg.10k + 
##     income.hh.median.10k + rent.median.100 + home.pmt.median.100 + 
##     race.black.any + ethn.nhisp.white.alone + married + kids + 
##     bachelor.plus + veteran + manufacturing + farm.fish.mining + 
##     commutes.car.alone + renter + single.fam.home + worked.12mo + 
##     poverty.12mo + snap.12mo, family = binomial(), data = AQI.ACS.geom)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3347  -0.6637  -0.5104  -0.3763   2.4371  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)   
## (Intercept)             7.962e-01  1.923e+00   0.414  0.67891   
## age.median             -2.691e-03  2.082e-02  -0.129  0.89717   
## commute.time.agg.10k   -1.050e-02  3.917e-02  -0.268  0.78865   
## income.hh.median.10k   -2.337e-01  1.215e-01  -1.923  0.05451 . 
## rent.median.100         4.126e-02  5.053e-02   0.816  0.41426   
## home.pmt.median.100     1.835e-05  3.366e-02   0.001  0.99957   
## race.black.any          1.417e-03  5.585e-03   0.254  0.79970   
## ethn.nhisp.white.alone -9.910e-03  5.076e-03  -1.952  0.05092 . 
## married                 1.540e-03  1.376e-02   0.112  0.91089   
## kids                    7.767e-03  1.155e-02   0.673  0.50125   
## bachelor.plus           1.035e-02  1.109e-02   0.933  0.35085   
## veteran                -1.466e-02  3.223e-02  -0.455  0.64912   
## manufacturing           9.666e-03  1.415e-02   0.683  0.49446   
## farm.fish.mining       -1.214e-02  1.786e-02  -0.679  0.49683   
## commutes.car.alone      1.179e-02  9.554e-03   1.234  0.21711   
## renter                  9.602e-04  9.174e-03   0.105  0.91664   
## single.fam.home         8.781e-03  6.445e-03   1.362  0.17310   
## worked.12mo            -3.752e-02  1.316e-02  -2.850  0.00437 **
## poverty.12mo           -5.823e-03  1.521e-02  -0.383  0.70187   
## snap.12mo              -3.912e-03  1.249e-02  -0.313  0.75423   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 870.13  on 908  degrees of freedom
## Residual deviance: 802.87  on 889  degrees of freedom
## AIC: 842.87
## 
## Number of Fisher Scoring iterations: 5
pval3
##                   acs.var  lm.p50.pval lm.pct.ungreen.pval logit.flag10.ungreen.pval logit.flag20.ungreen.pval
## 1             worked.12mo 6.892815e-07        0.0007749118              4.531875e-03               0.004367877
## 2        farm.fish.mining 4.193095e-06        0.0136765637              1.854684e-02               0.496829902
## 3          race.black.any 3.901449e-05        0.2560166477              1.100201e-02               0.799699586
## 4           manufacturing 3.082985e-04        0.0003058825              1.465384e-07               0.494461052
## 5      commutes.car.alone 5.012410e-04        0.3944535502              5.448595e-01               0.217107780
## 6              age.median 1.174578e-02        0.1451167688              7.396545e-02               0.897173725
## 7                 veteran 1.332100e-02        0.1816154439              3.022059e-01               0.649115596
## 8  ethn.nhisp.white.alone 1.670511e-02        0.0031615614              1.454114e-02               0.050915996
## 9            poverty.12mo 2.290010e-01        0.2235836814              6.812551e-02               0.701870729
## 10                   kids 2.381670e-01        0.8152731603              5.358795e-01               0.501254312
## 11    home.pmt.median.100 4.194097e-01        0.5500206039              2.949380e-01               0.999565046
## 12          bachelor.plus 4.224260e-01        0.5298874824              1.738009e-01               0.350849549
## 13                 renter 4.692571e-01        0.3269485271              3.777317e-01               0.916636934
## 14        single.fam.home 4.744072e-01        0.2819304914              8.836892e-01               0.173100905
## 15   commute.time.agg.10k 5.662155e-01        0.8795974016              7.740095e-01               0.788645632
## 16                married 6.179593e-01        0.7530172352              6.487405e-01               0.910894353
## 17        rent.median.100 6.282817e-01        0.9713137300              3.598450e-01               0.414260334
## 18   income.hh.median.10k 7.374727e-01        0.5268151364              7.770352e-01               0.054510257
## 19              snap.12mo 7.376328e-01        0.6184171875              6.025790e-01               0.754228687

Bivariate plots for most strongly-associated predictor variables with median PM2.5 and 10%-of-days outcomes:

# worked.12mo
ggplot(data = AQI.ACS.geom, aes(x = worked.12mo, y = pm2.5_p50)) + 
  geom_point() + geom_smooth(color = "red", method = "lm") + 
  xlab("% Worked in Past 12 Months") +
  ylab("Median Daily PM2.5 Concentration (mcg/m3)")
## `geom_smooth()` using formula 'y ~ x'

ggplot(data = AQI.ACS.geom, aes(x = pm2.5_flag10_ungreen, y = worked.12mo)) +
  geom_violin(fill = "steelblue1", draw_quantiles = c(0.25, 0.5, 0.75)) + 
  xlab("% of Days with PM2.5 >= 12mcg/m3") + 
  ylab("% Worked in Past 12 Months") 

# manufacturing
ggplot(data = AQI.ACS.geom, aes(x = manufacturing, y = pm2.5_p50)) + 
  geom_point() + geom_smooth(color = "red", method = "lm") + 
  xlab("% of Workers in Manufacturing Industry") +
  ylab("Median Daily PM2.5 Concentration (mcg/m3)")
## `geom_smooth()` using formula 'y ~ x'

ggplot(data = AQI.ACS.geom, aes(x = pm2.5_flag10_ungreen, y = manufacturing)) +
  geom_violin(fill = "steelblue1", draw_quantiles = c(0.25, 0.5, 0.75)) + 
  xlab("% of Days with PM2.5 >= 12mcg/m3") + 
  ylab("% of Workers in Manufacturing Industry") 

# farm.fish.mining
ggplot(data = AQI.ACS.geom, aes(x = farm.fish.mining, y = pm2.5_p50)) + 
  geom_point() + geom_smooth(color = "red", method = "lm") + 
  xlab("% of Workers in Agriculture, Forestry,\nFishing & Hunting, and Mining Industry Category") +
  ylab("Median Daily PM2.5 Concentration (mcg/m3)")
## `geom_smooth()` using formula 'y ~ x'

ggplot(data = AQI.ACS.geom, aes(x = pm2.5_flag10_ungreen, y = farm.fish.mining)) +
  geom_violin(fill = "steelblue1", draw_quantiles = c(0.25, 0.5, 0.75)) + 
  xlab("% of Days with PM2.5 >= 12mcg/m3") + 
  ylab("% of Workers in Agriculture, Forestry,\nFishing & Hunting, and Mining Industry Category")
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm): collapsing to unique 'x' values

# ethn.nhisp.white.alone
ggplot(data = AQI.ACS.geom, aes(x = ethn.nhisp.white.alone, y = pm2.5_p50)) + 
  geom_point() + geom_smooth(color = "red", method = "lm") + 
  xlab("% of Respondents Reporting\nNon-Hispanic Ethnicity and White Race Only") +
  ylab("Median Daily PM2.5 Concentration (mcg/m3)")
## `geom_smooth()` using formula 'y ~ x'

ggplot(data = AQI.ACS.geom, aes(x = pm2.5_flag10_ungreen, y = ethn.nhisp.white.alone)) +
  geom_violin(fill = "steelblue1", draw_quantiles = c(0.25, 0.5, 0.75)) + 
  xlab("% of Days with PM2.5 >= 12mcg/m3") + 
  ylab("% of Respondents Reporting\nNon-Hispanic Ethnicity and White Race Only")

# race.black.any
ggplot(data = AQI.ACS.geom, aes(x = race.black.any, y = pm2.5_p50)) + 
  geom_point() + geom_smooth(color = "red", method = "lm") + 
  xlab("% of Respondents Reporting Black Race\n(with or without other categories)") +
  ylab("Median Daily PM2.5 Concentration (mcg/m3)")
## `geom_smooth()` using formula 'y ~ x'

ggplot(data = AQI.ACS.geom, aes(x = pm2.5_flag10_ungreen, y = race.black.any)) +
  geom_violin(fill = "steelblue1", draw_quantiles = c(0.25, 0.5, 0.75)) + 
  xlab("% of Days with PM2.5 >= 12mcg/m3") + 
  ylab("% of Respondents Reporting Black Race\n(with or without other categories)")

dim(AQI.ACS)
## [1] 909  25
set.seed(1504492)
kgroup <- data.frame(kgroup = sample(1:5, size = dim(AQI.ACS)[1], replace = T))
AQI.ACS.onlynec <- bind_cols(AQI.ACS,data.frame(kgroup = sample(1:5, size = dim(AQI.ACS)[1], replace = T))) %>% 
                    select(-pm2.5_p50, -pm2.5_pct_ungreen, -pm2.5_flag20_ungreen, -denominator) %>%
                    group_by()

for (i in 1:5) {

    train <- filter(AQI.ACS.onlynec, kgroup != i)
    test  <- filter(AQI.ACS.onlynec, kgroup == i)
    
  # GLM: All variables
    glm <- train %>% glm(pm2.5_flag10_ungreen ~ 
               age.median + commute.time.agg.10k + income.hh.median.10k + rent.median.100 
             + home.pmt.median.100 + race.black.any + ethn.nhisp.white.alone + married + kids 
             + bachelor.plus + veteran + manufacturing + farm.fish.mining + commutes.car.alone 
             + renter + single.fam.home + worked.12mo + poverty.12mo + snap.12mo
             , data = .
             , family = binomial())
    
    glm.predict <- bind_cols(test[,"GEOID"], data.frame(glmpred = predict(glm, test, type = "response")))
    
  # GLM: Selected variables
    select.glm <- train %>% glm(pm2.5_flag10_ungreen ~ 
               age.median + race.black.any + ethn.nhisp.white.alone + veteran 
               + manufacturing + farm.fish.mining + worked.12mo + commutes.car.alone
             , data = .
             , family = binomial())
    
    select.predict <- bind_cols(test[,"GEOID"], data.frame(selectpred = predict(select.glm, test, type = "response")))
    
    # Random forest
    rf <- train %>% randomForest(pm2.5_flag10_ungreen ~ 
               age.median + commute.time.agg.10k + income.hh.median.10k + rent.median.100 
             + home.pmt.median.100 + race.black.any + ethn.nhisp.white.alone + married + kids 
             + bachelor.plus + veteran + manufacturing + farm.fish.mining + commutes.car.alone 
             + renter + single.fam.home + worked.12mo + poverty.12mo + snap.12mo
             , data = .
             , ntree = 2000)
    
    rf.predict <- bind_cols(test[,"GEOID"], data.frame(rfpred = predict(rf, test, type = "prob")[,2]))
    
    # Support vector machines with radial kernel
    svm <- train %>% svm(pm2.5_flag10_ungreen ~ 
               age.median + commute.time.agg.10k + income.hh.median.10k + rent.median.100 
             + home.pmt.median.100 + race.black.any + ethn.nhisp.white.alone + married + kids 
             + bachelor.plus + veteran + manufacturing + farm.fish.mining + commutes.car.alone 
             + renter + single.fam.home + worked.12mo + poverty.12mo + snap.12mo
             , data = .
             , scale = TRUE
             , kernel = "radial"
             , probability = TRUE)
    
    svm.step1 <- predict(svm, test, probability = TRUE)

    svm.predict <- bind_cols(test[,"GEOID"], data.frame(svmpred = attr(svm.step1, "probabilities")[,2]))

    # Join the predictions together by GEOID
    two.predict   <- inner_join(glm.predict, rf.predict, by="GEOID")
    three.predict <- inner_join(two.predict, select.predict, by="GEOID")
    four.predict  <- inner_join(three.predict, svm.predict, by="GEOID")
        
    if (i==1){
      collect.predict <- four.predict
    } else {
      collect.predict <- bind_rows(collect.predict, four.predict)
    }
}

AQI.ACS.preds <- inner_join(AQI.ACS.onlynec, collect.predict, by = "GEOID")

library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
AQI.ACS.preds %>% roc(pm2.5_flag10_ungreen, rfpred, ci=TRUE)
## Setting levels: control = 0-10%, case = >10%
## Setting direction: controls < cases
## 
## Call:
## roc.data.frame(data = ., response = pm2.5_flag10_ungreen, predictor = rfpred,     ci = TRUE)
## 
## Data: rfpred in 372 controls (pm2.5_flag10_ungreen 0-10%) < 537 cases (pm2.5_flag10_ungreen >10%).
## Area under the curve: 0.6966
## 95% CI: 0.6618-0.7314 (DeLong)
AQI.ACS.preds %>% roc(pm2.5_flag10_ungreen, glmpred, ci=TRUE)
## Setting levels: control = 0-10%, case = >10%
## Setting direction: controls < cases
## 
## Call:
## roc.data.frame(data = ., response = pm2.5_flag10_ungreen, predictor = glmpred,     ci = TRUE)
## 
## Data: glmpred in 372 controls (pm2.5_flag10_ungreen 0-10%) < 537 cases (pm2.5_flag10_ungreen >10%).
## Area under the curve: 0.7062
## 95% CI: 0.6719-0.7405 (DeLong)
AQI.ACS.preds %>% roc(pm2.5_flag10_ungreen, selectpred, ci=TRUE)
## Setting levels: control = 0-10%, case = >10%
## Setting direction: controls < cases
## 
## Call:
## roc.data.frame(data = ., response = pm2.5_flag10_ungreen, predictor = selectpred,     ci = TRUE)
## 
## Data: selectpred in 372 controls (pm2.5_flag10_ungreen 0-10%) < 537 cases (pm2.5_flag10_ungreen >10%).
## Area under the curve: 0.7157
## 95% CI: 0.6821-0.7494 (DeLong)
AQI.ACS.preds %>% roc(pm2.5_flag10_ungreen, svmpred, ci=TRUE)
## Setting levels: control = 0-10%, case = >10%
## Setting direction: controls < cases
## 
## Call:
## roc.data.frame(data = ., response = pm2.5_flag10_ungreen, predictor = svmpred,     ci = TRUE)
## 
## Data: svmpred in 372 controls (pm2.5_flag10_ungreen 0-10%) < 537 cases (pm2.5_flag10_ungreen >10%).
## Area under the curve: 0.706
## 95% CI: 0.6713-0.7407 (DeLong)
plot.roc(AQI.ACS.preds$pm2.5_flag10_ungreen, AQI.ACS.preds$svmpred, col="yellow")
## Setting levels: control = 0-10%, case = >10%
## Setting direction: controls < cases
plot.roc(AQI.ACS.preds$pm2.5_flag10_ungreen, AQI.ACS.preds$rfpred, col="darkgreen", add=TRUE)
## Setting levels: control = 0-10%, case = >10%
## Setting direction: controls < cases
plot.roc(AQI.ACS.preds$pm2.5_flag10_ungreen, AQI.ACS.preds$glmpred, col="blue", add=TRUE)
## Setting levels: control = 0-10%, case = >10%
## Setting direction: controls < cases
plot.roc(AQI.ACS.preds$pm2.5_flag10_ungreen, AQI.ACS.preds$selectpred, col="orange", add=TRUE)
## Setting levels: control = 0-10%, case = >10%
## Setting direction: controls < cases
legend("bottomright"
       ,legend = c("SVM (Radial)","Random Forest", "Logistic: All Vars", "Logistic: Select Vars")
       ,col = c("yellow","darkgreen", "blue", "orange")
       ,lwd = 1)

head(AQI.ACS.preds)


#SVM
roc(obs.outputs, pred.outputs.svm, ci = TRUE)
plot.roc(titanic$survived, svm.pred.lived)
plot.roc(obs.outputs, pred.outputs.svm, ci = TRUE, col = "blue", add = TRUE)
legend("bottomright", legend = c("Training", "Cross-Validation"), col = c("black", "blue"), lwd = 1)

#Random Forest
roc(obs.outputs, pred.outputs.rf, ci = TRUE)
plot.roc(titanic$survived, rf.pred.lived)
plot.roc(obs.outputs, pred.outputs.rf, ci = TRUE, col = "darkgreen", add = TRUE)
legend("bottomright", legend = c("Training", "Cross-Validation"), col = c("black", "darkgreen"), lwd = 1)

#Lasso
roc(obs.outputs, pred.outputs.lasso, ci = TRUE)
plot.roc(titanic$survived, titanic.lasso.pred[ ,1])
plot.roc(obs.outputs, pred.outputs.lasso, ci = TRUE, col = "red", add = TRUE)
legend("bottomright", legend = c("Training", "Cross-Validation"), col = c("black", "red"), lwd = 1)

#Plot both cross-validation ROCs
plot.roc(obs.outputs, pred.outputs.svm, ci = TRUE, col = "blue") #CV of svm
plot.roc(obs.outputs, pred.outputs.rf, ci = TRUE, col = "darkgreen", add = TRUE) #CV of rf
plot.roc(obs.outputs, pred.outputs.lasso, ci = TRUE, col = "red", add = TRUE) #CV of lasso
legend("bottomright", legend = c("SVM Cross-Validation", "RF Cross-Validation", "Lasso Cross-Validation"), col = c("blue", "darkgreen", "red"), lwd = 2)



table(AQI.ACS.preds$glmpred,AQI.ACS.preds$pm2.5_flag10_ungreen)


  pred.status.glm[1:length(s[s == i]) + offset] <- glm.pred.test
  
  # Random forest
    rf <- train %>% select(-status) %>% randomForest(statusfactor ~ ., data = ., ntree = 100)
    rf.pred.test <- predict(rf, newdata = test, type = "prob") 
    pred.status.rf[1:length(s[s == i]) + offset] <- rf.pred.test[ , 2]
  
    offset <- offset + length(s[s == i])
}



rf.flag10.ungreen <- randomForest(pm2.5_flag10_ungreen ~ 
               age.median + commute.time.agg.10k + income.hh.median.10k + rent.median.100 
             + home.pmt.median.100 + race.black.any + ethn.nhisp.white.alone + married + kids 
             + bachelor.plus + veteran + manufacturing + farm.fish.mining + commutes.car.alone 
             + renter + single.fam.home + worked.12mo + poverty.12mo + snap.12mo
             ,data = AQI.ACS.geom
             ,ntree = 2000
             ,importance = TRUE)

rf.flag10.ungreen
rf.flag10.ungreen$importance[,c("MeanDecreaseGini")] %>% sort(decreasing = TRUE)


rf.flag20.ungreen <- randomForest(pm2.5_flag20_ungreen ~ 
               age.median + commute.time.agg.10k + income.hh.median.10k + rent.median.100 
             + home.pmt.median.100 + race.black.any + ethn.nhisp.white.alone + married + kids 
             + bachelor.plus + veteran + manufacturing + farm.fish.mining + commutes.car.alone 
             + renter + single.fam.home + worked.12mo + poverty.12mo + snap.12mo
             ,data = AQI.ACS.geom
             ,ntree = 2000
             ,importance = TRUE)

rf.flag20.ungreen
rf.flag20.ungreen$importance[,c("MeanDecreaseGini")] %>% sort(decreasing = TRUE)


AQI.ACS.svm <- svm(pm2.5_flag10_ungreen ~ 
               age.median + commute.time.agg.10k + income.hh.median.10k + rent.median.100 
             + home.pmt.median.100 + race.black.any + ethn.nhisp.white.alone + married + kids 
             + bachelor.plus + veteran + manufacturing + farm.fish.mining + commutes.car.alone 
             + renter + single.fam.home + worked.12mo + poverty.12mo + snap.12mo
             , data = AQI.ACS.geom
             , scale = TRUE
             # , kernel = "linear"
             , kernel = "radial")

# Linear kernel:
#         0-10% >10%
#   0-10%   192  180
#   >10%    111  426

# Radial kernel: Better! 
#         0-10% >10%
#   0-10%   193  179
#   >10%     46  491

plot(worked.12mo ~ ethn.nhisp.white.alone, data=AQI.ACS.svm, col = pm2.5_flag10_ungreen)

summary(AQI.ACS.svm)
svm.flag10pred <- fitted(AQI.ACS.svm)
table(AQI.ACS.geom$pm2.5_flag10_ungreen, svm.flag10pred)

?plot.svm
length(svm.flag10pred)
length(AQI.ACS.geom$pm2.5_flag10_ungreen)

?e1071::plot
plot(svm.flag10pred, AQI.ACS.geom[,c("pm2.5_flag10_ungreen","worked.12mo","ethn.nhisp.white.alone")], worked.12mo ~ ethn.nhisp.white.alone) 




?svm

Retrieve a range of census descriptive data by census tract. Join with air quality summary data.

# still to come 
# still to come